Skip to content

Fix Database Installation Path and Prevent Duplicate Database#36

Merged
calladoum-elastic merged 5 commits into
elastic:mainfrom
chengyongru:fix-dbpath
Feb 12, 2026
Merged

Fix Database Installation Path and Prevent Duplicate Database#36
calladoum-elastic merged 5 commits into
elastic:mainfrom
chengyongru:fix-dbpath

Conversation

@chengyongru

Copy link
Copy Markdown
Contributor

Summary

This PR fixes two database installation issues:

  1. Database was incorrectly installed at die/db/db/ instead of die/db/
  2. die_library was installing a duplicate database to site-packages/db (Windows only)

Changes

1. Fix Database Installation Path

  • CMakeLists.txt: Fixed database installation path
    • Changed DESTINATION die/db to DESTINATION die for db and db_custom
    • Added ICU DLL installation patterns for Windows

2. Prevent Duplicate Database Installation

  • python/CMakeLists.txt: Added workaround to remove incorrectly installed db
    • die_library on Windows installs db to site-packages/db
    • Added install(CODE) to remove this duplicate before our correct installation
    • Only affects Windows platform

3. Backward Compatibility

  • init.py: Added _DatabasePath class for backward compatibility
    • Automatically detects old (die/db/db/) vs new (die/db/) structure
    • Handles both old and new usage patterns with deprecation warning

4. Tests and Documentation

  • test_die.py: Added comprehensive database path tests
  • test_regression.py: Updated to use corrected database path
  • README.md: Updated usage examples to show correct database path

Backward Compatibility

Fully backward compatible - Existing code using database_path / "db" will continue to work with a deprecation warning.

New code can use the simpler database_path directly:

# Before (workaround for the bug)
res = die.scan_file(file, flags, str(die.database_path / "db"))

# After (correct usage)
res = die.scan_file(file, flags, str(die.database_path))

Impact

Size Reduction

  • Before: 18.50 MB (duplicate databases)
  • After: 12.01 MB (single database)
  • Reduction: ~6.5 MB (~30% smaller wheel)

Installation Structure

  • Fixes database installation to correct location: die/db/
  • Removes duplicate database from site-packages/db
  • No breaking changes to existing code

Testing

All tests pass with both old and new database structures:

  • Existing installations with die/db/db/ will work normally
  • New installations with die/db/ will work correctly
  • The _DatabasePath class automatically handles both cases

@cla-checker-service

cla-checker-service Bot commented Jan 29, 2026

Copy link
Copy Markdown

💚 CLA has been signed

@calladoum-elastic

Copy link
Copy Markdown
Collaborator

@chengyongru Can you sign the CLA (see bot comment)? The PR cannot be merged until then

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes how the Detect-It-Easy (DIE) signature database is packaged/installed for the Python bindings, addressing an incorrect nested install path and a Windows-only duplicate database install, while adding backward-compatible path handling and tests.

Changes:

  • Correct CMake install destinations so the database lands in die/db/ (not die/db/db/) and add Windows-specific cleanup + ICU DLL install patterns.
  • Add a backward-compatibility shim for die.database_path so both new (database_path) and old (database_path / "db") usage work.
  • Update tests and README examples to use the corrected database path.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
python/CMakeLists.txt Fixes database install destination; adds Windows workaround to remove wrongly installed top-level db; extends Windows DLL install patterns.
python/die/init.py Introduces _DatabasePath compatibility layer and updates database_path behavior/documentation.
python/src/die.cpp Bumps native module __version__.
python/tests/test_die.py Updates assertions for new database path behavior and adds new database path tests.
python/tests/test_regression.py Updates regression test to use the corrected database path and renames the test to match issue 28.
README.md Updates usage examples and paths to reflect the corrected database location.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread python/tests/test_die.py
Comment on lines 21 to 23
# validate die database
assert isinstance(die.database_path, pathlib.Path)
assert isinstance(die.database_path, die._DatabasePath)
assert die.database_path.exists()

Copilot AI Feb 9, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion now couples the public API to the private _DatabasePath implementation. If _DatabasePath is meant to be an internal compatibility shim, tests should validate the public contract instead (e.g., os.fspath(die.database_path) returns an existing directory, and it behaves like a PathLike) rather than asserting the internal type.

Copilot uses AI. Check for mistakes.
Comment thread python/tests/test_die.py Outdated
Comment on lines +213 to +214
assert (db_path / dir_name).exists() or (db_path / dir_name).exists(), \
f"Expected directory {dir_name} not found at {db_path}"

Copilot AI Feb 9, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion uses the same expression on both sides of the or ((db_path / dir_name).exists() or (db_path / dir_name).exists()), which makes the or redundant and suggests the test isn't checking what was intended. If the goal is to allow both layouts, consider checking the alternate location explicitly (e.g., nested db/), or just assert on the resolved path only.

Suggested change
assert (db_path / dir_name).exists() or (db_path / dir_name).exists(), \
f"Expected directory {dir_name} not found at {db_path}"
# Allow both new layout (db_path/dir_name) and legacy layout (db_path/db/dir_name)
assert (db_path / dir_name).exists() or (db_path / 'db' / dir_name).exists(), \
f"Expected directory {dir_name} not found at {db_path} (or legacy db/ layout)"

Copilot uses AI. Check for mistakes.
Comment thread python/die/__init__.py Outdated
DeprecationWarning,
stacklevel=2
)
return base_path

Copilot AI Feb 9, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__truediv__ returns a pathlib.Path instance here (return base_path), while other divisions return a _DatabasePath/PurePath. This inconsistent return type can surprise callers and makes path chaining behave differently depending on the segment. Prefer returning the same path type consistently (e.g., return a _DatabasePath pointing at the resolved base path) while still emitting the deprecation warning.

Suggested change
return base_path
# Return a _DatabasePath (or subclass) pointing at the base path
return self.__class__(base_path_str)

Copilot uses AI. Check for mistakes.
Comment thread python/tests/test_die.py
Comment on lines +224 to +228
res = die.scan_file(
target_binary,
die.ScanFlags.DEEP_SCAN,
database=str(die.database_path)
)

Copilot AI Feb 9, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These multi-line scan_file calls are not formatted consistently with the rest of the file (missing trailing comma on the last argument, so Black would reformat it). Please add trailing commas so formatting is stable and consistent with existing calls above.

Copilot uses AI. Check for mistakes.
Comment thread python/CMakeLists.txt Outdated
Comment on lines +74 to +78
install(CODE [[
set(WRONG_DB_DIR "${CMAKE_INSTALL_PREFIX}/db")
if(EXISTS "${WRONG_DB_DIR}")
execute_process(COMMAND ${CMAKE_COMMAND} -E remove_directory "${WRONG_DB_DIR}")
message(STATUS "Removed incorrectly installed database: ${WRONG_DB_DIR}")

Copilot AI Feb 9, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This install-time removal deletes ${CMAKE_INSTALL_PREFIX}/db if it exists. Even if this is intended for wheel staging, it is very broad and could remove an unrelated db directory when installing to a non-staging prefix. Consider narrowing the deletion by verifying the directory contains DIE signature markers (e.g., PE/) and/or removing only the exact path that die_library installs into, to reduce the risk of unintended data loss.

Copilot uses AI. Check for mistakes.
Comment thread python/die/__init__.py Outdated
database_path = pathlib.Path(__path__[0]) / "db"
"""Path to the DIE signature database"""

class _DatabasePath(pathlib.PurePosixPath):

Copilot AI Feb 9, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

database_path used to be a pathlib.Path, but _DatabasePath currently subclasses pathlib.PurePosixPath, which is not a concrete filesystem path type and breaks isinstance(die.database_path, pathlib.Path) for existing callers. Consider instead subclassing the platform-specific concrete path type (e.g., type(pathlib.Path())) or making database_path a normal pathlib.Path and adding a separate helper/wrapper for the backward-compat resolution, so the public API type remains stable.

Suggested change
class _DatabasePath(pathlib.PurePosixPath):
_BasePath = type(pathlib.Path())
class _DatabasePath(_BasePath):

Copilot uses AI. Check for mistakes.
Comment thread python/die/__init__.py Outdated
Comment on lines +108 to +116
This path automatically points to the correct database location:
- In new versions (0.6.0+): directly at die/db/
- In old versions (0.5.x): at die/db/db/

Usage:
# New code (recommended):
die.scan_file(file, flags, str(die.database_path))

# Old code (still works, but shows deprecation warning):

Copilot AI Feb 9, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring claims the fixed layout applies to "new versions (0.6.0+)", but this PR bumps the extension module to 0.5.1 and the packaging metadata appears to still be 0.5.x. Please align these version references (or remove explicit version numbers) so the documentation matches the actual release/versioning used by the package.

Suggested change
This path automatically points to the correct database location:
- In new versions (0.6.0+): directly at die/db/
- In old versions (0.5.x): at die/db/db/
Usage:
# New code (recommended):
die.scan_file(file, flags, str(die.database_path))
# Old code (still works, but shows deprecation warning):
This path automatically points to the correct database location,
regardless of how the package is laid out:
- When the database directory is installed directly at die/db/
- When the database directory is installed at die/db/db/
Usage:
# Recommended:
die.scan_file(file, flags, str(die.database_path))
# Legacy code (still works, but may show a deprecation warning):

Copilot uses AI. Check for mistakes.
Comment thread python/src/die.cpp

m.doc() = "The native `die` module";
m.attr("__version__") = "0.5.0";
m.attr("__version__") = "0.5.1";

Copilot AI Feb 9, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m.attr("__version__") is bumped to 0.5.1 here, but the build/package versions still appear to be 0.5.0 (e.g., pyproject.toml and python/CMakeLists.txt). This will lead to inconsistent version reporting (wheel metadata vs die.__version__). Please update all version sources in the repo together or derive the extension version from the project/package version to keep them in sync.

Suggested change
m.attr("__version__") = "0.5.1";
m.attr("__version__") = DIE_VERSION;

Copilot uses AI. Check for mistakes.
Comment thread python/tests/test_die.py Outdated
Comment on lines +184 to +188
# Check if this is the new fixed version (database at die/db/PE/)
# by checking if base path contains PE/
base_path = pathlib.Path(str(die.database_path).replace('\\db\\db', '\\db\\'))
# Note: We can't easily check base_path, so we use warning presence

Copilot AI Feb 9, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

base_path is computed but never used, and the preceding comment says you "can't easily check" it. This is dead code that can be removed to keep the test logic focused and avoid confusing future readers.

Suggested change
# Check if this is the new fixed version (database at die/db/PE/)
# by checking if base path contains PE/
base_path = pathlib.Path(str(die.database_path).replace('\\db\\db', '\\db\\'))
# Note: We can't easily check base_path, so we use warning presence

Copilot uses AI. Check for mistakes.
@chengyongru

Copy link
Copy Markdown
Contributor Author

@calladoum-elastic Ok, I've signed the CLA but it looks like the status hasn't been refreshed yet.

@chengyongru

Copy link
Copy Markdown
Contributor Author

@calladoum-elastic
I've incorporated the relevant Copilot feedback and significantly optimized the distribution by stripping unnecessary files—the final WHL size is now 9.30 MB. Verified that the CI suite passes.

I'm currently blocked by the CLA bot. Despite having signed the agreement, my status isn't being recognized. Could you please help trigger a manual re-check or advise on the next steps?

@calladoum-elastic

Copy link
Copy Markdown
Collaborator

Hi! I will try to see what is happening.

@calladoum-elastic

Copy link
Copy Markdown
Collaborator

Hi @chengyongru

I'm currently blocked by the CLA bot. Despite having signed the agreement, my status isn't being recognized. Could you please help trigger a manual re-check or advise on the next steps?

The easiest fix would be to update your ~/.gitconfig with the email provided to the CLA, and amend your commits to match this email. The validation should work then.

Thanks!

## Problem
The database was incorrectly installed at die/db/db/ due to CMake
install directive using DESTINATION die/db for the db directory,
which created a double-nested structure.

## Changes
- CMakeLists.txt: Fixed database installation path
  - Changed DESTINATION die/db to DESTINATION die for db and db_custom
  - Added ICU DLL installation patterns for Windows

- __init__.py: Added _DatabasePath class for backward compatibility
  - Automatically detects old (die/db/db) vs new (die/db) structure
  - Handles both old and new usage patterns with deprecation warning
  - Provides seamless transition for existing code

- Tests: Updated to work with both old and new database paths
  - test_die.py: Added comprehensive database path tests
  - test_regression.py: Updated to use corrected database path

- README.md: Updated usage examples
  - Changed database_path/'db' to database_path (new recommended usage)
  - Updated database path examples to show correct die/db/ structure

## Impact
- Fixes database installation to correct location: die/db/
- Maintains backward compatibility with old code using database_path / "db"
- Users can upgrade without breaking existing code
die_library's CMakeLists.txt on Windows installs the database to
CMAKE_INSTALL_PREFIX/db (site-packages/db), while die-python's
CMakeLists.txt correctly installs it to die/db. This caused:
1. Duplicate database installation (wasted ~6.5MB)
2. Incorrect db location in site-packages root

Added install(CODE) workaround in python/CMakeLists.txt to remove the
incorrectly installed db directory before our correct installation.

- Wheel size reduced from 18.50MB to 12.01MB (~30% reduction)
- Database only installed in correct location: die/db/
- No duplicate database directories
- Use concrete Path type for _DatabasePath to maintain isinstance() compatibility
- Remove duplicate OR expression in test assertion
- Bump version to 0.5.1 across all files
- Remove version numbers from database path docstring
- Remove unused base_path variable in test
- Add trailing commas for multi-line function calls
die_library's CMakeLists.txt installs several files to incorrect locations
that should not be included in the Python wheel:

1. db/ directory - installed to site-packages/db, should only be in die/db
2. die.lib - installed to root directory, should only be in die/die.lib
3. include/die.h - C++ headers not needed in Python wheel package

This commit uses a loop-based install(CODE) approach to remove these files
during the installation process, keeping the wheel package clean and minimal.

whl size: 9.30MB
@chengyongru

Copy link
Copy Markdown
Contributor Author

@calladoum-elastic I just tried changing the email, it seems there is no problem now.

When path operations like / operator create new _DatabasePath instances,
they may bypass __new__ in some Python versions, causing _resolved_path_str
to not be initialized. This triggers AttributeError when _get_resolved_str()
tries to access the missing attribute.

Solution: Use getattr() with a default value in _get_resolved_str() to
handle cases where __new__ wasn't called, making the code robust across
different Python versions and pathlib implementations.

Changes:
- Use getattr(self, '_resolved_path_str', None) instead of direct access
- Keep __truediv__ simple - only handle business logic
- Remove redundant comments that just repeat what the code says

Fixes AttributeError: '_DatabasePath' object has no attribute '_resolved_path_str'

References:
- https://docs.python.org/3/library/pathlib.html#pathlib.PurePath.with_segments
- python/cpython#100479
@chengyongru

Copy link
Copy Markdown
Contributor Author

@calladoum-elastic

I fixed _DatabasePath AttributeError from CI failure

Noticed the test failure at:

https://github.com/elastic/die-python/actions/runs/21890730972

@calladoum-elastic

Copy link
Copy Markdown
Collaborator

Since macos-13 will be removed in a coming PR, all tests are passing.

@calladoum-elastic calladoum-elastic merged commit fcb41c1 into elastic:main Feb 12, 2026
18 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants